MedOr: Order of Medians Based on 
Confidence Statements 

Carlos A. de B. Pereira Adriano Polpo * 
December 24, 2012 

Abstract 

Comparing two samples in relation to corresponding parameters 
of their respective populations is an old and classical statistical prob- 
lem. In this paper we present a simple yet effective tool to compare 
two samples through their medians. We calculate the confidence of 
the statement "the median of the first population is strictly smaller 
(larger) than the median of the second." We analyze two real data 
sets and empirically demonstrate the quality of the confidence for such 
statement. This confidence of the order of the medians is to be seen as 
a pre-analysis tool that can provide useful insights for comparing two 
or more populations. The method is completely based on their exact 
distribution with no need for asymptotic considerations. We also pro- 
vide the MedOr statistical software, an R package that implements 
the ideas discussed in this work. 

Keywords: Significance test, comparison of two samples, confidence in- 
terval based on the binomial distribution. 
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1 INTRODUCTION 



This paper proposes an analysis that can be used as an aid for subsequent 
more complex statistical data analyses. We discuss ideas to compare two 
independent groups and to evaluate a measure that indicates which group 
has smaller (larger) values than the other one. They are simple and effective, 
without the need of sophisticated techniques. This work was motivated by 
the following example in oncology: preoperative Gleason scores in general 
provide valuable prognoses for cases with prostate cancer. However, the same 
quality is not verified for patients with a score of Gleason-7, because this 
group of patients is characterized by tumors displaying great morphological 
heterogeneity among affected regions. In order to search for a gene set that 
could distinguish between recurrent (R) and non-recurrent (NR) Gleason-7 
prostate cancer patients, microarray data have been collected. A possible 
important gene that is associated with this disease is the RPS28 gene. In the 
study, there are two samples: the first sample has m = 5 of R patients and 
the second sample has n = 20 of NR patients. Table [TJ lists the microarray 
expression data for the 25 patients and an illustration is given in Figure 1. 
As in many medical experiments, there is only a few cases in this study, and 
most of them are non- recurrent. 

Table 1: Expression of Gene RPS28 for Gleason-7 patients: Recurrent and 
Non-recurrent Cases. 



Recurrent 


14.8557 15.2209 15.3839 15.4106 15.4155 


Non-recurrent 


14.9309 14.9535 15.1009 15.1622 15.4361 
15.4716 15.4932 15.5545 15.5584 15.5622 
15.5629 15.5741 15.5759 15.6101 15.6211 
15.6488 15.6638 15.6684 15.6966 15.6984 



Suppose that the expression of a specific important gene is observed for 
each patient of the two independent samples. Let the recurrent and non- 
recurrent cases, with inter-ordered samples (observations), be, respectively, 
(x(i),X( 2 ), • • • ,x (m )) and (y^, y^, y( n ))] m and n are the sample sizes. 
The objective is to find genes that are under (or over) expressed, which is 
sometimes expressed by the statement that an expected microarray observa- 
tion of a R smaller (larger) than the expected observation of an 
NR case, y. In other words, it is conjectured that, for x and y being ob- 
servations of random variables X and Y, one could expect, for under (over) 
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Figure 1: RPS28 Arrays for Gleason 7: Non- recurrent and Recurrent Cases. 

expressed situations that the probability of {X < Y} is larger (smaller) than 
a specified value, for example 0.8 (0.2). One of the statistical hypotheses 
that could indicate the validity of the conjecture is Mx < My (with M used 
to indicate median and the subscripts used to separate R and NR cases). 
Note that uppercase letters are used for random variables and parameters 
and lowercase letters for observations: probabilities refer to X and Y and 
confidence refers to x and y. 

In this work, we propose a measure to evaluate the confidence of the 
statement {Mx < My} (and obviously of {Mx > My} as well). We name 
this measure as confidence statement. The proposed confidence statement 
was developed following the ideas of the nonparametric confidence interval 
for a population's median based on the binomial distribution. The work is 
organized as follows: in Section [21 we give a brief review of the confidence 
interval for the population's median, and then we introduce the confidence 
statement; in Section [31 we analyze two real data examples, discussing the 
applicability of the procedure; in Section HJ we provide conclusions and final 
remarks. 

2 METHODS 

2.1 Confidence Intervals for Medians 

In this section we present the nonparametric confidence interval for a pop- 
ulation's median based on the binomial distribution. For additional details 
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we refer to iThompsonl (119361 ). iDavid and Nagarajal (120031 . Chap. 7). 



An event related to a random variable X is represent by A, while Mj 
is the median of X. Pi (A | Mx) indicates the probability of the event A 
when Mx is known. In general, the median Mx of a random variable X is a 
population parameter that satisfies the following inequalities: 

Pr(X < M x | M x ) > - and Pr(X > M x \ M x ) > -. 

2 2 



In the continuous case, these inequalities are tight: 



Pr(X < M x | M x ) = - and Pr(X > 

2 



M x I M x ) = -. 



Considering that (Xi,X 2 , . . . ,X m ) is a vector of m independent and iden- 
tically distributed random variables, we have that (l/2) m is the probability 
of the event "all observations are smaller than Mx" Hence, the probabil- 
ity that at least Xt m \ (the sample maximum - the parenthesis in the sub- 
script is used to indicate the order) is larger than Mx is the complementary 
probability 1 — (l/2) m . Define Xu\ as the z-th order statistics. One may 
consider the interval (i/^iy) as a confidence interval for the median Mx, 
for which the value of the confidence is obtained as follows: the probabil- 
ity that all observations are in one of the sides of Mx, right or left, should 
be 2(l/2) m = (l/2) m ~ 1 . Again, taking the complement, one obtains the 
probability of the event {X {1) < M x < X (m) } as 1 - (l/2) m - 1 . 

After observing the sample, we write that the statement {xm < Mx < 
x^)} has a confidence equal to 1 — (l/2) m ~ 1 . We call the attention of the 
reader to the subtle difference betwe en probability and confidence, as pre- 
sented in Pereira and Castilho (l2009h . which justifies the use of distinct ter- 



minology. To clarify, before the observations are obtained and by using the 
order statistics Xm and Xi m \ (minimum and maximum), we write the fol- 
lowing expression: 



Pr(X (1) < M x < X {m) ) = 1 - Qj 



m—l 



After observing the sample, {x(i) < Mx < £(m)} is only a statement: 
we do not know the value of Mx but we know the sample values of all 
order statistics, x^, . . . ,xr m y It can be said that one has a confidence of 
1 — (l/2) m_1 that the median is within the sample extreme values: in this 
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case, there are no probabilities anymore. Using the sample of recurrent cases 
in Table [TJ and as 1 — (1/2) 4 = 0.9375, we could say with confidence 93.75% 
that the interval (14.856, 15.416) contains the population's median value. 
Also, as 1 - (1/2) 5 = 0.96875, we are confident that {M x < 15.416}, with 
confidence value 96.88%. To be more formal, prior to observations, we use 
the notation Pr(M x < X (5) | M x ) = 0.96875. 

As an analogy, one can think of the above method as equivalent to toss- 
ing a coin m times, computing the probability of zero successes, which is 
(l/2) m , and taking its complement, 1 — (l/2) m . The same arguments can 
be used to obtain the probability of having two observations in one side 
and all the remaining on the other side of Mx- The event {X( m _!) > Mx} 
happens if neither {M x > A( m )} nor {M x > X( m _i^} occur. Conditional 
on Mx to be known, the probability of {Mx > X( m ) or Mx > X( m _i)} is 
(l/2) m + 2(l/2) m = 3(l/2) m . Hence, Pr(X (m _ 1) > M x \ M x ) = 1- 3(1/2)™. 
Consequently, the confidence of the interval (xm, x ( OT -i)) is 1 — 3(l/2) m_1 . 
For instance, considering m = 8, we obtain the confidence values for the 
statements {x( m _i) > M x } and {x( m _i) > M x > %(2)}, which are equal 
to 0.96484375 and 0.9296875, respectively. Extending now for any order of 
statistics, we can simply think of the number of successes in m tosses of a 
fair coin. 

Letting i and j be indices in the set {1,2,..., m}, the events {X^ < M x } 
and {A(j) > Mx} are those in which we are interested. For i < j and by 
using the same arguments of the previous discussion, we have the following 
probabilities: 

Pr(A w < M x | M x ) 

Pr(X {j) > M x | M x ) 

To obtain the confidence of the interval (x^,x^), the same argument of 
tossing a fair coin is used. We then obtain the following: 

Pr(A w < M X < X U) | M X ) = £ (JJ 

For m = 15, we have 0.982421875 and 0.96484375 as the confidence values 
for the statements {x^ 2 ) > M x } and {x^ < M x < £(12)}, respectively. 
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To illustrate the confidence interval, we generate a sample with n — 10 
from a normal distribution with mean and variance 1. The generated data 
is 

x = (-1.6293,-0.7927,-0.5913,-0.3776,-0.2004, 
-0.1904,-0.0777, 0.0099, 0.1771, 2.5159). 

We are interested in the interval with 95% of confidence. Our procedure is 
based in a discrete exact distribution and it will not obtain an exact 95% stan- 
dard level (or any other level) but a close one: the higher the sample size the 
closer it will be. Our simulated data produce the intervals (—1.6293, 0.0099) 
and (-0.7927,0.1771) with, respectively, 94.43% and 97.85% of confidence. 
Since the second, although with smaller amplitude, has a larger confidence, 
we choose it as our confidence interval. Using now the standard method of 
confidence interval and representing the mean and the standard error by x 
and se, we obtain the 95.45% confidence interval as 

(x-2se, x + 2se) = (-0.1157-0.6995, - 01157 + 0.6995) 
= (-0.7851, 0.5538). 

The length of our 97.85% interval is 0.9698, smaller than 1.3390, which 
is the length of the standard one based on the t-student distribution, with 
95.45% of confidence. Thus, we obtained a more confident shorter interval. 

2.2 Confidence Statement on the Order of Medians 

Returning to the problem of two samples that are used to compare two sub- 
populations, assume they are named case and control, the goal is to analyze 
the statement that the population median Mx of X is smaller (larger) than 
the population median My of Y: one of the statements {Mx < My} or 
{M x > M Y } is true. Recall that we use the notation (x(i), xp), ■ ■ ■ , £( m )) 
and (2/(1) , 2/(2) , • • • ,V(n)) f° r the ordered sample vectors. In fact, we have in- 
dependent samples of intra-sample independent and equally distributed ob- 
servations. 

Suppose that there are observations x^ and y^, such that x^ < y^y 
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We can write the following probabilities: 



i-l , w , 



Pr(X (i) >M X \M X ) = [ 
Pt(Y u) < M Y \ My) = J2(f) 



and then for the joint probability one obtains 
Pr(M x < X (i) & Y U) < M Y \ M x , My) = 



n \ I 1 
2 




After observing that £m < yy) for the indices i and j, the confidence of 
the statement {M x < xrq < yu\ < M Y } is equal to the right side of the 
previous expression. 

We point out that we are looking for the shortest interval with high 
confidence. Consequently, to evaluate the confidence of the statement {M x < 
My}, we should look for the best pair (i,j) such that X($ < y^) that produces 
a high confidence and a high value of Pr(M x < & Y(j) < My \ M x , My). 
The consequence is that the statement {M x < My} has a confidence equal 
to 

sup Py(M x < X (i) k Y (j) <M Y \M X , M Y ). 

ij- x (i)<vu) 

The closer we get to 1, the more confident we are about M x < My. 



3 EXAMPLES 



3.1 The Prostate Cancer 

In the example shown in Table [U the statement {M x < My} has a confidence 
equal to 




= 0.9688 x 0.9941 = 0.9630. 
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This is a consequence of the fact that 15.416 = X( 5 ) < = 15.436 and 
that 

Pr(M x < X (5) | M x ) Pr(F {5) < My \ M Y ) = 

Pi(M x < X (5) & y (5) < My | Mx, My) = 0.963. 

In other words, we are 96.3% confident about the statement {M x < My}. 
3.2 The Schizophrenia Data Set 

The Schizophrenia data set is from the Altar A study of the Stanley Medical 
Research Institute's online genomics database (SMRIDB) 



The Stanley Medical Research Institute! (120121 ) ; lHiggs et al.l ( 120061 ). The data 



have m = 32 patients with schizophrenia and n = 34 individuals in the con- 
trol group. 20, 993 probe microarrays were reported. Our interest here is 
to find the most differentially expressed genes. For the analysis, we evalu- 
ate both statements {M x < My} and {M x > My}, and keep the highest 
confidence in each case. Table |2] presents the 10 transcripts with the highest 
confidence and their respective statements. In order to perform this analysis 
in the whole data set, our procedure took about 15 minutes in a standard 
modern computer (Intel® Core™ i7 M620 2.67GHz). 



Table 2: Schizophrenia data set: genes with largest confidence. 



Transcripts 


Confidence 


Status 


Median Order* 


215003 


0.99609 


Under 


M 5 < M c 


208581 


0.99521 


Over 


M s > M c 


212854 


0.99200 


Over 


M s > M c 


216336 


0.98681 


Over 


M s > M c 


212294 


0.98681 


Over 


M s > M r 


213626 


0.98549 


Over 


M s > M r 


209847 


0.98549 


Under 


M s < M c 


208399 


0.98549 


Under 


M s < M c 


204326 


0.98549 


Over 


M s > M c 


221011 


0.98439 


Under 


M s < Mc 



* Ms', median for schizophrenic patients and 



Mc- median for control individuals. 
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3.3 Discussion 



In the prostate cancer example, it must be noticed that by using the one 
side t-test one obtains a p- value of 7.24% (14.48% for the two-sided test). 
This is used to test H : fix = fly versus A : fix < (A : fix ^ fly for 
the two-sided test), fi here is the notation for the mean, not for medians. 
Such particular test has only asymptotic properties if the distributions of X 
and Y are not normal. On the other hand, the present paper proposes a 
method that does not use any distribution restriction, is exact and valid for 
any sample size. 

The developmen t of the present meth od builds on the studies from 



Zellner et al.l (120041) and IWassermanl ( 120101 ). The ideas of conditional state- 



ments came from lKieferl (Il977f ). Simplicity and lack of barriers were our main 
goals in building such a method. Without restrictions and by being simple, 
a method mig ht not be able t o be powerfu l . Som e nonparametric methods, 
for example in iNoetherl (Il99ll ) ; IWassermanl (120061 ) , do not directly use all the 
ordered observations. They only use the order statistics xm and of each 
group. 

By using the equivalence of confidence statements and significance testing 
( jPeGrootl Il975l ). one could, without great distress, state the significance of 
testing H : M x < M Y versus A : M x > M Y) for the data in Table [TJ We 
are prone to say that the significance favoring H against A could be 96.3%. 
Interchanging the hypotheses but keeping A as the null hypothesis, the exact 
p-value favoring A would then be 3.7%. That is, under the standard policy, 
we would reject the hypothesis of equality of medians and we would expect 
gene RPS28 to be under expressed for R patients when compared to the same 
gene in the NR group. 

In the schizophrenia example, we analyzed all 20, 993 genes to find those 
that were most differentially expressed. We found that among the 10 most 
differentially expressed transcripts, 4 were under and 6 were over expressed. 
Also, all confidence values were larger than 98%, which are good confidence 
levels in our opinion. 



4 CONCLUSIONS 

The intention of this work is to provide a method that can be employed as 
a first-step procedure whenever a data set is to be analyzed. The authors 
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believe that this method can be used to eliminate those variables that have no 
power to help in the discovery of differentially expressed transcripts, before 
conducting other more complex/specialized procedures. 

The readers shall see that the method can be extended to more than two 
groups. In order to do that, the confidence level to detect a strict order has 
to be studied in more detail. The larger the number of sample groups, the 
smaller is the expected confidence. This is because the product of numbers 
belonging to the interval (0, 1) clearly produces numbers that are smaller 
than any of their factors. For instance, consider 3 random variables, X, Y 
and Z. The following inequality is obvious: 

Pr(M x < X (a) | M x ) Pr(y (6) < My < X {c) | M Y ) Pr(Z (d) < M z \ M z ) < 
min {Pr(M x < X {a) \ M x ), Pr(Y (6) < My < X (c) \ My), 
Pr(Z (d) < M z \ M z )} . 

If the observed order of statistics follows the inequality X( a ) < V{b) < V{ c ) < 
Z{d)i (for orders a, b, c and d), then the statement {Mx < My < M z } would 
have a smaller confidence than the confidences obtained when comparing 
a specific pair of the three medians. Hence, the confidence cut-off point to 
induce decisions would have to decrease with the increasing number of groups 
that are to be compared. 

The procedure to evaluate the confidence statement is available in the 
R package MedOr at \http: //code. google, com/p/medor/ The package is dis- 



tributed as an open source program under GPLv3 license. We tested the 
code by using samples with size 1, 000 and with 4 different populations, and 
obtained running times of less than 5 minutes. For the schizophrenia data 
set, we evaluated 41, 986 different statements for the two populations (sizes 
m = 32 and n = 34) with a running time of less than 15 minutes. 
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